MEDB 5501, Module05

2024-09-17

Topics to be covered

  • What you will learn
    • Interpretation of linear regression coefficients
    • Computing linear regression in R
    • The least squares principle
    • The analysis of variance table
    • Computing the analysis of variance table in R
    • Confidence interval for the slope parameter
    • Computing confidence intervals in R
    • Your homework

Bad joke, 1 of 4

Bad joke, 2 of 4

Bad joke, 3 of 4

Bad joke, 4 of 4

Quote from “Peggy Sue Got Married”

Algebra formula for a straight line

  • \(Y=mx+b\)
  • \(m = \Delta y / \Delta x\)
  • m = slope
  • b = y-intercept

Linear regression interpretation of a straight line

  • The slope represents the estimated average change in Y when X increases by one unit.
  • The intercept represents the estimated average value of Y when X equals zero.
  • Terminology
    • X is the independent or predictor variable
    • Y is the dependent or outcome variable

Simple regression example with interpretation, 1

Simple regression example with interpretation, 2

Simple regression example with interpretation, 3

Simple regression example with interpretation


Call:
lm(formula = age_stop ~ mom_age, data = bf)

Coefficients:
(Intercept)      mom_age  
      5.920        0.389  

Predicted values, 1

  • How long would you expect a 20 year old mom to breast feed?
# A tibble: 5 × 2
  mom_age age_stop
    <dbl>    <dbl>
1      20        5
2      20       29
3      20        6
4      20       NA
5      20       12

Predicted values, 2

  • For an existing value in the data, \(X_i\)
    • \(\hat{Y}_i=b_0+b_1 X_i\)
  • For a new value of X
    • \(\hat{Y}_{new}=b_0+b_1 X_{new}\)
    • Do not predict outside the range of X values

Why predict for a value you already have seen?

  • Future Y may differ from previous Y
  • \(\hat{Y}_i\) is more precise
  • Comparison of \(\hat{Y}_i\) to existing \(Y_i\).

Predicted values, 3

Predicted age_stop = 5.92 + 0.389*20 = 13.7

# A tibble: 1 × 2
  mom_age .fitted
    <dbl>   <dbl>
1      20    13.7

Residuals, 1

  • \(e_i=Y_i-\hat{Y}_i\)
    • Residual = Observed - Predicted
  • Very helpful in assessing assumptions

Residuals, 2

Residuals, 3

Residuals, 4

  • Votes (Buchanan) = 45.3 +0.0049 * Votes (Bush)
    • The estimated average number of votes for Buchanan increase by 1/200 for every increase of one vote for Bush.

Residuals, 5

  • In Palm Beach County
    • Votes (Bush) = 152,846
    • Predicted Votes (Buchanan) = 797
      • 45 + 0.0049 * 152,846
    • Actual Votes (Buchanan) = 3,407
  • Residual = 3,407 - 797 = 2,610

Residuals, 6

# A tibble: 4 × 5
  .rownames mom_age age_stop .fitted .resid
  <chr>       <dbl>    <dbl>   <dbl>  <dbl>
1 8              20        5    13.7  -8.70
2 40             20       29    13.7  15.3 
3 44             20        6    13.7  -7.70
4 67             20       12    13.7  -1.70

Break #1

  • What you have learned
    • Interpretation of linear regression coefficients
  • What’s coming next
    • Computing linear regression in R

Location of data dictionary and code

Break #2

  • What you have learned
    • Computing linear regression in R
  • What’s coming next
    • The least squares principle

The population model

  • \(Y_i=\beta_0+\beta_1 X_i + \epsilon_i,\ i=1,...,N\)
    • \(\epsilon_i\) is an unknown random variable
      • Mean 0, standard deviation, \(\sigma\)
      • Often assumed to be normal
    • \(\beta_0\) and \(\beta_1\) are unknown parameters
    • \(b_0\) and \(b_1\) are estimates from the sample

Least squares principle, 1

  • Collect a sample
    • \((X_1,Y_1),\ (X_2,Y_2),\ ...\ (X_n,Y_n)\)
  • Compute residuals
    • \(e_i=Y_i-(b_0+b_1*X_i)\)
    • Choose b_0 and b_1 to minimize \(\Sigma e_i^2\)

Least squares principle, 2

Least squares principle, 3

Least squares principle, 4

Least squares principle, 5

  • General solution
    • \(b_1=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)
    • \(b_0=\bar{Y}-b_1\bar{X}\)
  • Notice the similarity between \(b_1\) and r
    • \(b_1=r\frac{S_Y}{S_X}\)

Relationship to the correlation coefficient

  • Recall from the previous module
    • \(Cov(X,Y)=\frac{1}{n-1}\Sigma(X_i-\bar{X})(Y_i-\bar{Y})\)
    • \(r_{XY}=\frac{Cov(X,Y)}{S_X S_Y}\)
  • This implies that
    • \(b_1=r_{XY}\frac{S_Y}{S_X}\)

Important implications

  • \(r_{XY}\) is unitless, \(b_1\) is Y units per X units
  • \(r_{XY}>0\) implies \(b_1>0\)
  • \(r_{XY}=0\) implies \(b_1=0\)
  • \(r_{XY}<0\) implies \(b_1<0\)
    • and vice versa

Break #3

  • What you have learned
    • The least squares principle
  • What’s coming next
    • The analysis of variance table

Sum of squares regression

Sum of squares error

Sum of squares total / corrected total

Sum of squares total (uncorrected)

ANOVA table for linear regression

\[\begin{matrix} & SS & df & MS & F-ratio \\ Regression & SSR & 1 & MSR=\frac{SSR}{1} & F=\frac{MSR}{MSE} \\ Error & SSE & n-2 & MSE=\frac{SSE}{n-2} & \\ Total & SST & n-1 & & \end{matrix}\]

Analysis of variance table in R

Analysis of Variance Table

Response: age_stop
          Df Sum Sq Mean Sq F value  Pr(>F)  
mom_age    1  570.0  569.99  5.7531 0.01879 *
Residuals 80 7925.9   99.07                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-squared

  • SST, total variation, is split into
    • SSR, explained variation, and
    • SSE, unexplained variation
  • \(R^2=\frac{SSR}{SST}=1-\frac{SSE}{SST}\)
    • \(0 < R^2 < 1\)
    • Proportion of explained variation
      • \(R^2 > 0.5\) strong
      • \(0.1 < R^2 < 0.5\) weak
  • \(R^2 = r^2\)

R-squared calculation in R

[1] 0.06708949

F-ratio, 1

  • \(F = \frac{MSR}{MSE}\)

Break #4

  • What you have learned
    • The analysis of variance table
  • What’s coming next
    • Computing the analysis of variance table in R

Location of code

Break #5

  • What you have learned
    • Computing the analysis of variance table in R
  • What’s coming next
    • Confidence interval for the slope parameter

Confidence intervals, 1

  • Population model
    • \(Y_i=\beta_0+\beta_1 X_i + \epsilon_i,\ i=1,...,N\)
  • Sample estimates
    • \(b_1=\frac{\Sigma(X_i-\bar{X})(Y_i-\bar{Y})}{\Sigma(X_i-\bar{X})^2}\)
    • \(b_0=\bar{Y}-b_1\bar{X}\)

Confidence intervals, 2

  • Standard error (se)
    • \(se(b_1)=\sqrt{\frac{MSE}{(n-1) S_x^2}}\)
  • Confidence interval for \(\beta_1\)
    • \(b_1 \pm t\ se(b_1)\)

Confidence intervals, 3

                  2.5 %    97.5 %
(Intercept) -3.19546976 15.035265
mom_age      0.06625878  0.711827

Hypothesis test, 1

  • \(H_0:\ \beta_1=0\)
  • \(H_1:\ \beta_1 \ne 0\)
    • Accept \(H_0\) if \(T=\frac{b_1}{se(b_1)}\) is close to zero
    • \(\ \) or Accept \(H_0\) if the confidence interval includes zero
    • \(\ \) or Accept \(H_0\) if the p-value is large

Hypothesis test, 2

# A tibble: 2 × 5
  term        estimate std.error statistic p.value
  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)    5.92      4.58       1.29  0.200 
2 mom_age        0.389     0.162      2.40  0.0188

Why two different ways to test?

  • \(F=T^2\)
    • Accept \(H_0\) if F is close to one
    • Accept \(H_0\) if T is close to zero
  • For both tests
    • Accept \(H_0\) if p-value is large
  • F and T differ in more complex settings
    • F is a global test of all variables
    • T is series of separate tests of individual variables

Break #6

  • What you have learned
    • Confidence interval for the slope parameter
  • What’s coming next
    • Computing confidence intervals in R

Location of code

Break #6

  • What you have learned
    • Computing confidence intervals in R
  • What’s coming next
    • Your homework

Location of programming assignment

Summary

  • What you have learned
    • Interpretation of linear regression coefficients
    • Computing linear regression in R
    • The least squares principle
    • The analysis of variance table
    • Computing the analysis of variance table in R
    • Confidence interval for the slope parameter
    • Computing confidence intervals in R
    • Your homework